18 research outputs found

    Video-Mined Task Graphs for Keystep Recognition in Instructional Videos

    Full text link
    Procedural activity understanding requires perceiving human actions in terms of a broader task, where multiple keysteps are performed in sequence across a long video to reach a final goal state -- such as the steps of a recipe or a DIY fix-it task. Prior work largely treats keystep recognition in isolation of this broader structure, or else rigidly confines keysteps to align with a predefined sequential script. We propose discovering a task graph automatically from how-to videos to represent probabilistically how people tend to execute keysteps, and then leverage this graph to regularize keystep recognition in novel videos. On multiple datasets of real-world instructional videos, we show the impact: more reliable zero-shot keystep localization and improved video representation learning, exceeding the state of the art.Comment: Technical Repor

    NaQ: Leveraging Narrations as Queries to Supervise Episodic Memory

    Full text link
    Searching long egocentric videos with natural language queries (NLQ) has compelling applications in augmented reality and robotics, where a fluid index into everything that a person (agent) has seen before could augment human memory and surface relevant information on demand. However, the structured nature of the learning problem (free-form text query inputs, localized video temporal window outputs) and its needle-in-a-haystack nature makes it both technically challenging and expensive to supervise. We introduce Narrations-as-Queries (NaQ), a data augmentation strategy that transforms standard video-text narrations into training data for a video query localization model. Validating our idea on the Ego4D benchmark, we find it has tremendous impact in practice. NaQ improves multiple top models by substantial margins (even doubling their accuracy), and yields the very best results to date on the Ego4D NLQ challenge, soundly outperforming all challenge winners in the CVPR and ECCV 2022 competitions and topping the current public leaderboard. Beyond achieving the state-of-the-art for NLQ, we also demonstrate unique properties of our approach such as the ability to perform zero-shot and few-shot NLQ, and improved performance on queries about long-tail object categories. Code and models: {\small\url{http://vision.cs.utexas.edu/projects/naq}}.Comment: 13 pages, 7 figures, appearing in CVPR 202

    SpotEM: Efficient Video Search for Episodic Memory

    Full text link
    The goal in episodic memory (EM) is to search a long egocentric video to answer a natural language query (e.g., "where did I leave my purse?"). Existing EM methods exhaustively extract expensive fixed-length clip features to look everywhere in the video for the answer, which is infeasible for long wearable-camera videos that span hours or even days. We propose SpotEM, an approach to achieve efficiency for a given EM method while maintaining good accuracy. SpotEM consists of three key ideas: 1) a novel clip selector that learns to identify promising video regions to search conditioned on the language query; 2) a set of low-cost semantic indexing features that capture the context of rooms, objects, and interactions that suggest where to look; and 3) distillation losses that address the optimization issues arising from end-to-end joint training of the clip selector and EM model. Our experiments on 200+ hours of video from the Ego4D EM Natural Language Queries benchmark and three different EM models demonstrate the effectiveness of our approach: computing only 10% - 25% of the clip features, we preserve 84% - 97% of the original EM model's accuracy. Project page: https://vision.cs.utexas.edu/projects/spotemComment: Published in ICML 202

    EgoEnv: Human-centric environment representations from egocentric video

    Full text link
    First-person video highlights a camera-wearer's activities in the context of their persistent environment. However, current video understanding approaches reason over visual features from short video clips that are detached from the underlying physical space and capture only what is immediately visible. To facilitate human-centric environment understanding, we present an approach that links egocentric video and the environment by learning representations that are predictive of the camera-wearer's (potentially unseen) local surroundings. We train such models using videos from agents in simulated 3D environments where the environment is fully observable, and test them on human-captured real-world videos from unseen environments. On two human-centric video tasks, we show that models equipped with our environment-aware features consistently outperform their counterparts with traditional clip features. Moreover, despite being trained exclusively on simulated videos, our approach successfully handles real-world videos from HouseTours and Ego4D, and achieves state-of-the-art results on the Ego4D NLQ challenge. Project page: https://vision.cs.utexas.edu/projects/ego-env/Comment: Published in NeurIPS 2023 (Oral

    Habitat-Matterport 3D Semantics Dataset

    Full text link
    We present the Habitat-Matterport 3D Semantics (HM3DSEM) dataset. HM3DSEM is the largest dataset of 3D real-world spaces with densely annotated semantics that is currently available to the academic community. It consists of 142,646 object instance annotations across 216 3D spaces and 3,100 rooms within those spaces. The scale, quality, and diversity of object annotations far exceed those of prior datasets. A key difference setting apart HM3DSEM from other datasets is the use of texture information to annotate pixel-accurate object boundaries. We demonstrate the effectiveness of HM3DSEM dataset for the Object Goal Navigation task using different methods. Policies trained using HM3DSEM perform outperform those trained on prior datasets. Introduction of HM3DSEM in the Habitat ObjectNav Challenge lead to an increase in participation from 400 submissions in 2021 to 1022 submissions in 2022.Comment: 14 Pages, 10 Figures, 5 Table

    A Domain-Agnostic Approach for Characterization of Lifelong Learning Systems

    Full text link
    Despite the advancement of machine learning techniques in recent years, state-of-the-art systems lack robustness to "real world" events, where the input distributions and tasks encountered by the deployed systems will not be limited to the original training context, and systems will instead need to adapt to novel distributions and tasks while deployed. This critical gap may be addressed through the development of "Lifelong Learning" systems that are capable of 1) Continuous Learning, 2) Transfer and Adaptation, and 3) Scalability. Unfortunately, efforts to improve these capabilities are typically treated as distinct areas of research that are assessed independently, without regard to the impact of each separate capability on other aspects of the system. We instead propose a holistic approach, using a suite of metrics and an evaluation framework to assess Lifelong Learning in a principled way that is agnostic to specific domains or system techniques. Through five case studies, we show that this suite of metrics can inform the development of varied and complex Lifelong Learning systems. We highlight how the proposed suite of metrics quantifies performance trade-offs present during Lifelong Learning system development - both the widely discussed Stability-Plasticity dilemma and the newly proposed relationship between Sample Efficient and Robust Learning. Further, we make recommendations for the formulation and use of metrics to guide the continuing development of Lifelong Learning systems and assess their progress in the future.Comment: To appear in Neural Network

    Hybrid EMD-RF Model for Predicting Annual Rainfall in Kerala, India

    No full text
    Rainfall forecasting is critical for the economy, but it has proven difficult due to the uncertainties, complexities, and interdependencies that exist in climatic systems. An efficient rainfall forecasting model will be beneficial in implementing suitable measures against natural disasters such as floods and landslides. In this paper, a novel hybrid model of empirical mode decomposition (EMD) and random forest (RF) was developed to enhance the accuracy of annual rainfall prediction. The EMD technique was utilized to decompose the rainfall signal into six intrinsic mode functions (IMFs) to extract underlying patterns, while the RF algorithm was employed to make predictions based on the IMFs. The hybrid RF–IMF model was trained and tested using a dataset of annual rainfall in Kerala from 1871 to 2020, and its performance was compared to traditional models such as RF regression and the autoregressive moving average (ARMA) model. Mean absolute error (MAE), mean absolute percentage error (MAPE), mean squared error (MSE), root mean squared error (RMSE), and coefficient of determination or R-squared (R2) were used to compare the performances of these three models. Model evaluation metrics show that the RF–IMF model outperformed both the RF model and ARMA model

    Micropropagation prospective of cotyledonary explants of <i>Decalepis hamiltonii</i> Wight & Arn.—An endangered edible species

    Get PDF
    256-260The study was undertaken to standardize the development of callus, shoot and root regeneration from cotyledonary explant of Decalepis hamiltonii Wight & Arn. through the tissue culture techniques. The MS medium supplemented with 6-benzyl amino purine (BA), 2,4-dichlorophenoxy acetic acid (2,4-D), kinetin (Kn), gibberelic acid (GA3), indole acetic acid (IAA), indole butyric acid (IBA) and 1-naphthalene acetic acid (NAA) was used for callus, shoot and root regeneration. The maximum percentage (82.0%)of callus formation was achieved on 0.5 mg/L BA in combination with 0.05 mg/L Kn, followed by 78.5% of callus formation on 0.5 mg/L 2,4-D fortified with 0.05 mg/L Kn. The highest shoot proliferation (4.6 shoots/callus) and shoot length (6.9 cm) was achieved on 1.0 mg/L BA combined with 0.1 mg/L GA3, followed by 3.8 shoots per callus and 5.8 cm shoot length on 1.0 mg/L IAA combined with 0.1 mg/L GA3. The highest root formation (38.2 roots/shoot) and root length (11.8cm) was achieved on ½ strength MS medium fortified with 0.4 mg/L IBA, followed by 36.5 roots per shoot and root length of 10.7 cm on 0.4 mg/L NAA. The well-developed rooted plantlets were hardened in the mixtures of forest soil, soil and vermiculite (1:1:1) and 97.5% plantlets survived after hardening

    Comparison of patient and graft survival in tacrolimus versus cyclosporine-based immunosuppressive regimes in renal transplant recipients – Single-center experience from South India

    No full text
    Studies have shown better graft function and reduced acute rejection rates among renal transplant recipients who were on Tacrolimus (Tac)-based immunosuppression regimens as compared to cyclosporine (CsA)-based regimens in the first year. However, the long-term follow-up data did not reveal better outcomes in the Tac-based regimens. In view of the short term benefits, the trend has been to change to Tac-based regimens off late. Data from the Indian subcontinent are, however, sparse. We, therefore, looked at our data to ascertain if Tac-based regimen does have better outcomes in our population. We studied a total of 108 individuals who underwent renal transplantation between January 2007 and June 2013, with a mean follow-up of 38.22 months (comparable to both groups). In our group, males constituted 77.8%,; and among the 108 individuals, 16.7% were diabetics. New-onset diabetes after renal transplantation was more common in the Tac group (21 vs. 12 and was statistically significant [P = 0.03]). At the last follow-up, serum creatinine was higher in the CsA group (1.77 mg/dl vs. 1.35 mg/dl) and was statistically significant (P = 0.03). Individuals requiring hemodialysis were also significantly higher in the CsA group (9 vs. 2; P = 0.05). The patient survival was similar in both groups (1-year and 5-year follow-up); however, graft survival was better in Tac group as compared to CsA group (0.94 vs. 0.88 at 1 year and 0.85 vs. 0.72 at 5 years)
    corecore